NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Wang, Junxiong; Paliotta, Daniele; May, Avner; Rush, Alexander M; Dao, Tri (December 2024, NeurIPS)

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN model. We also find that the distilled model has natural length extrapolation, showing almost perfect accuracy in the needle-in-a-haystack test at 20x the distillation length.
more » « less
Full Text Available
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Schiff, Yair; Kao, Chia-Hsiang; Gokaslan, Aaron; Dao, Tri; Gu, Albert; Kuleshov, Volodymyr (July 2024, International Conference on Machine Learning)

Full Text Available
Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Dao, Tri; Fu, Daniel Y.; Saab, Khaled K.; Thomas, Armin W.; Rudra, Atri; Ré, Christopher (May 2023, Proceedings of the 11th International Conference on Learning Representations (ICLR))

Full Text Available
Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Fu, Daniel Y.; Epstein, Elliot L.; Nguyen, Eric; Thomas, Armin W.; Zhang, Michael; Dao, Tri; Rudra, Atri; Ré, Christopher (July 2023, Proceedings of the 40th International Conference on Machine Learning (ICML))

Full Text Available
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri; Ré, Christopher (December 2022, Proceedings of the 35th Neural Information Processing Systems Conference (NeurIPS))

Full Text Available
ButterflyFlow: Building Invertible Layers with Butterfly Matrices

Meng, Chenlin; Zhou, Linqi; Choi, Kristy; Dao, Tri; Ermon, Stefano (January 2022, International Conference on Machine Learning)

Full Text Available
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Chen, Beidi; Dao, Tri; Liang, Kaizhao; Yang, Jiaming; Song, Zhao; Rudra, Atri; Re, Christopher (January 2022, International Conference on Learning Representations (ICLR))

Full Text Available
Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Dao, Tri; Chen, Beidi; Sohoni, Nimit S.; Desai, Arjun; Poli, Michael; Grogan, Jessica; Liu, Alexander; Rao, Aniruddh; Rudra, Atri; Re, Christopher (January 2022, Proceedings of the 39th International Conference on Machine Learning)

Full Text Available
HiPPO: Recurrent Memory with Optimal Polynomial Projections

Gu, Albert; Dao, Tri; Ermon, Stefano; Rudra, Atri; Re, Christopher (December 2020, Advances in neural information processing systems)
null (Ed.)
A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed. We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. As special cases, our framework yields a short derivation of the recent Legendre Memory Unit (LMU) from first principles, and generalizes the ubiquitous gating mechanism of recurrent neural networks such as GRUs. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale. HiPPO-LegS enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients. By incorporating the memory dynamics into recurrent neural networks, HiPPO RNNs can empirically capture complex temporal dependencies. On the benchmark permuted MNIST dataset, HiPPO-LegS sets a new state-of-the-art accuracy of 98.3%. Finally, on a novel trajectory classification task testing robustness to out-of-distribution timescales and missing data, HiPPO-LegS outperforms RNN and neural ODE baselines by 25-40% accuracy.
more » « less
Full Text Available
Scatterbrain: Unifying Sparse and Low-rank Attention

Chen, Beidi; Dao, Tri; Winsor, Eric; Song, Zhao; Rudra, Atri; Re, Christopher (January 2021, Advances in neural information processing systems)

Full Text Available

« Prev Next »

Search for: All records